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The investigation focused on the effects of using 
grouped data to estimate the relations that exist in data on 
Indilrlduals* Different research contexts vere identified in which 
researchers group observations though Interested in relations amoji^ 
measurements on individuals. The consequences of estimating 
regtesslon coefficients from grouped data were examined from a 
**struetural eguat Ions'* perspective. A simple linear regression model 
was hypothesi2ed and then modified by the incorporation of a 
If grouping variable. A taxonomy was generated from the modified model 
so that every possible grouping variable fit Into one of four 
categories defined by the relations of the grouping criterion to 
other variables in the system. Each category was then examined for 
bias and efficiency of estimation.. General principles were determined 
for choosing a grouping method which minimizes loss of information. 
The complications that arose in the multiple regression ca@d wife 
also delineated. (Author/SH) 
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ISSUES CONCERNING INFERENCES 
PROM GROUPED OBSERVATIONS* 

Leiqh Burstein ^ 
Department of Educational Psychol o^y 
University of Viisccnsin-Milwaukee 

1, KlTRODUCTIOtj 

This presentation focuses on the effects of us inn data from nroups 
of individuals to estitnc|te relations that exist in data on individuals, 
Such discussions occur in the research literature under the names "data 
acjgrannticn", the "qroupinn of cLsorvation", or sin.ply "groupinn", which 
ail refer to the replacement of numbers representim observations on indi 
viduals with a smaller set of nui;iLers representing oborvations annreoated 
(in the present context, overawed) over firoups on individuals. For ex- 
ar.ple an investinator may oroup observations by classroom and analyze 
beU/ocn-class relations. 

Tne study of arouped data introduces no special obstacles when in- 
ferences are restricted to the level at which the data are collected and 
analyzed, Complications can arise, however, when investioators turn 
to data on aroups of individuals to estimate regression and correlation 
coefficients at the individual level. An attempt to estimate the re- 
lation between student achievement and student aspiration level from 
class means for achievement and aspiration can result in seriously mis- 
leading estimates and faulty Inference. The types of problems consid- 
ered In this paper are called "change In the units of analysis" problems 

Paper presented at the Annual' Meetlnn of the American Educational 
Research Association, April 19, 1974, Chicago, Illinois. 
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(Blalock, 1974) where, like' the example above|1nferences about the rela- 
tions amonn Individuals are desired but the data are analyzable at the 
group level only. 

The objectives of this discussion are: 

(1) to identify research contexts in which investlnators 
estimate regression and correlation paran)etcrs"frotn 
grouped observations thounh interested in relations 
anong nieasurctnents on individuals; 

and 

(2) to clarify the conditions under which estimates of re- 
gression and correlation coefficients obtained from 
grouped data are consistent with estimates that would 
be obtained from unnrouped data. 

First, the different research contexts in which grouping can arise 
are discussed and earlier Investigations of each context which considered 
data aggregation methods are cited,' A framework is offered to clarify 
certain similarities among the different research contexts and thus 
simplify the process of identifying whether a particular grouping strat* 
egy is applicable for a given context. 

Next, the existing results from three different approaches ("clus- 
tering", "optimal grouping", and "structural equations") for examining 
the effects of grouping observations are summarized. This discussion 
focuses on the parallels and contradictions among these different lines 
of inquiry. Of the approaches contrasted, the "structural equations" 
appears to be most promising and will receive the most attention throucih- 
out the paper. 

In Section 4 a more general approach which subsumes all others is 
presented, This approach is an extension of the "Structural equations" 
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approach originally articulated by Blalock (1964) and Hannan (1970, 
1971, 1972). We concentrate on the simple linear regression model and 
describe systematic procedures for examining the consequences of differ- 
ent methods of grouping in this two-variable case. 

The general strategy described in Section 4 is to modify the struc- 
tural regression nodel at the individual level by incorporating the 
grouping characteristic directly into the model. This modified causal 
structure leads logically to a taxonomy whereby every possible grouping 
characteristic fits into one of several mutually exclusive categories 
defined by the relations of the characteristics to the variables in the 
'original regression model. The different characteristics within a given 
category then have similar implications for the precision of estimating 
individual-level relations from grouped observations. 

In Section 5 data from a study of incoming freshmen at a large 
raidwestern university are used to illustrate the procedures developed 
here. The results for both regression and correlations coefficients 
are found to conform with the predictions from exteniled "structural 
equations" approach. Additionally, a compositing procedure is described 
which generates a more stable estimate of the individual-level regression 
coefficient from the separate estimates from data grouped by several of 
the best grouping variables. 

In the concluding section, the suggestions for improving inferences 
from grouped data are sumrized. In addition, pmi&im strategies 
for treating unordered grouping chfiractcri sties such as classroom are 
suggested. Complications that arise in extending thtst ri^sult.s to the 
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multivariate case are also delineated. The prediction of effects under 
nominal qroupinn conditions and the further elaboration of tho consequences 
of aggregation within a multivariate causal structure represent the kind 
of aggregation problems where more research and attention will be needed 
before investigators can analyze all kinds of change-in-units problems 
with confidence. 

2« Data An<!rcf?ation in Different Research Contexts 

The analysis of grouped data is becoming increasingly common in 
.educational research as investigators contemplate massive census-like 
data in addition to school and classroom aggregate measures. Carefully 



chosen grouping m'Jthods can dWo be applied when the question of con- 
fidentiality of data arises, when data is missing from some individuals 
in a study, and when the variables in the study are fallibly measured.^ 
The degree of investigator control ever the aggregation of data is 
an important consideration for each kind of "change-in-the-units-of- 
analysis" problem. In certain contexts group membership is determined 
in some natural way by, e.g., school attended or state of residence, and 
Is thus beyond the investigator's control except for exclusion of certain 
sampling units and individuals (limited or no investigator control). 
In other contexts the investigator can manipulate the formation of groups 
at least in part (complete or partial control). Obviously there are more 
options in the latter contexts for improving the precision of estimates 
from grouped observations. 
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Table 1 presents a detailed account of the different research con- 
texts and the investigator's options for controlling the formation of 
groups. Explanations regardino why data aggregation methods are needed, 
how such methods are applied, v/here they are principally applied, and 
who previously conducted research related to each context are also in- 
cluded. 



Table 1 



Contexts Allowing Complete Investigator Control Over nrouplnci Membership. 
Despite the seemingly extensive lists of references, aggregation proced- 
ures have been applied infrequently in Contexts (A) and (B), at least 
in recent years. Perhaps further simplification and clarification of 
of the grouping methods may be necessary to extend their use in these 
contexts. However, a more likely explanation for their limited use is 
that more powerful statistical methods have been proposed (see Affifi 
and Elashoff (1966, 1967) on the missing information problem and Madansky 
(1959), Blalock et al. (1970), Blalock (1971), and Wi^iy and Wiley (1971) 
on the measurement error problem.) It is still an open question whether 
these other methods will warrant more consideration once further refine- 
ments in grouping methods have been made. 

Econometricians have already developed and demonstrated sound prin- 
ciples for data aggregation where the size and economy of analysis (Con- 
text (C)) is the chief concern (Prais and Aitchinson (1954), Thcil (1954), 
Cramer (1964), Green (1964).) The other social and behavioral sciences 
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are just beginning to tap the tremendous wealth of methodological advances 
made by the econometricians. The methodology for handling data aggrega- 
tion problems is no exception to the slow pace at which educational, psych- 
ological, and sociological researchers are incorporating the econometri- 
cians' "power tools" into their theory-building enterprises. 

In the attempt to expand the conceptual theory for handling change- 
In-units problems, this investigator incorporates the econometric results 
that simplify the present efforts and builds on their framework where the 
special problems of dealing v/ith individuals, rather than institutions 
or commodities, dictate modifications. As v/ill be shown, however, the 
work of Prais and Aitchinson (1954), and later of Cramer (19C4), in Con- 
text (A) is an essential part of any adequate conceptualization of the 
problems of data aggregation discussed in this paper. 

Partial Investinator Control over Qroup Membershin— Confidential Data 
in Social fCesearcit. iiie US5 ot partial anorenatlon tBchmgues for anal y- 

zing confidentially collected Information is. a relatively new notion. 
Feige and Watts (1970) apparently were the first to recognize the util- 
ity of grouping methods in this context. The procedure itself is guite 
simple. The investigator collects information on potentially suitable 
grouping characteristics in addition to those of primary interest. The 
individual observations can then be collapsed into different groups and 
the parameters of interest can be estimated from the between-group 
relations. This prcc^^dure is viable as long as the grouping character- 
i sties are neasured simultaneously with each primary variable (regard-* 
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less of whether the Information on the primary variable is collected 
anonymously or with the individual subjects identified), and satisfy 
certain conditions necessary for precise estimation in change-in-units 
analyses. 

Conducting research on confidential data presents very complicated 
social and political problems. Boruch (1971(a), 1971(b), 1972(a), 1972 ' 
(b)) brings into focus the ethical and legal considerations associated 
with research under confidentiality constraints besides suggesting al- 
ternatives to partial aggrenation methods. Although the need for the 
for the privacy and protection of subjects in social research is recog- 
nized, this presentation does not deal directly with such complications. 
The procedures suggested in this investigation offer individuals assur- 
ances of their anonymity while maintaining the nossibility of carrying 
out research on topics that can further understanding of the complex 
Interactions among individuals and institutions within our society. 
The premise is that a person can maintain his or her individual identity 
and still cooperate with efforts to clarify the cornerstones of social 
processes through analysis methods designed to allow examination of re- 
lations among human characteristics without directly identifying the par- 
ticipating individuals. 

Limited or No Investinator Control — "Ecolonical Inference". The tonic 
of ecological inference has received a lot of attention in the sociological 
literature. There was an extended exchange of ideas on the subject among 
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socioloqists and social statisticians (Robinson, 1950; Duncan and Davis, 

1953; Goodman, 1953; fioodman, 1959) in the 1950*s, Tliounh the debate 

centered around methods for ovcrcomlnc the "ecological fallacy", there 

were many who just vondered what all the fuss was about. After all, the 

group and Inter-group relations occupy a prominent position in sociology^ 

and thus group-level analyses should be acceptabk,^ 

The educational and psychological literature hardly reflects an 

awareness of the "ecological fallacy" of inferring correlations between 

properties of individuals frofn the ecological correlations derived from 
If. 

group data. Important reports (Coleman et aj[,, 1966) and papers in 
respected journals of educational and psychological research ((loldberg, 
1969; Rock et a^., 1970; Baird and Feister, 1972) perform between-group 
analyses without directly considering whether the relations estimated 
at the group level are applicable at the individual level. 

The correct response to the query regarding approoriate level of 
inference is net obvious. However, the dramatic change cited by Rob- 
inson (19tO) in size of the correlation between Illiteracy and race as 
a function of the coarseness of the units of analysis (,95 at the region- 
al level, ,77 at the state level, ,20 at the individual level), should 
warn researchers not to take the query lightly. The investigator defin- 
itely needs to understand the rules governing his grouping process in 
any empirical study in the social or behavioral sciences (Lewy, 1972), 

In any case, aggregate sampling units present a particularly com- 
plex type of aggregation problem since questions regarding sampling bias 
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arise in addition to concerns about level of Inference, One question 
r.iay be whether the sanpled schools are renresentlve of the schools In 
the universe to which one wants ito generalize. The Investigator must 
clearly understand the basis for) his Inferences to the Individual level 
in order to be at all confident about his estimates. Otherwise, it may 
be best to make inferences at the group level or to examine the Individ- . 
uals within groups (Wiley, 1970 ). 

Applicability of the Taxoriomic Approach to the Different Research Contexts. 
The strategies developed here are suitable mainly for Contexts (A) through 
'(D). However, they do have Implications as to how one can proceed when 
the grouping characteristic has a nominal scale which most often happens 
in Context (E). To take full advantage In Context (E) of the procedures 
described below, the Investigator must first determine how to express the 
relations of the grouping characteristic to primary variables In an ordinal 
fashion. How this can be done is a topic requiring more research, but once 
it is done, the strategies arc applicable. 
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3. The Dasic Model and Acconpanylnn Review of the Literature 

A basic summary of the different approaches to change-in-units analy- 
sis is provided in Hannan and Burstein (1974). A more complete accounting 
of the problems and strategies of groupinn for individual-level inferences 
can be found in Burstein (1974), In the present paper, the effects of 
grouping on the estimation of the relation between tv/o variables within 
a simple linear regression framework is examined. The reason for the 
restriction to the two-variable case is that the different approaches 
achieve their purest and simplest forms when only two variables are con- 
sidered. Forecasting results for higher-order relations is possible, 
tut the strategies can not be as clearly delineated (See Section 6). 

The regression model is examined because it results in formulation 
suitable for estimating both regression and correlation coefficients. 
If the preferred "structural equations" approach is followed, the inves- 
tigator need only conduct his analysis on the individual observations 
in standardized form if he wishes' to estimate correlations. Once the 
individual observations are standardized, the comparison of the regression 
coefficients before and after data aggregation becomes essentially a 
comparison of zero-order correlation coefficients at the individual and 
group levels. This slight tv/ist of procedure enables the investigator 
to apply the same basic model to both regression and correlation coeffi- 
cients. 

The analysis that follows might best be viewed in the conteKt of 
a substantive problem. Assume that an investigator wished to study the 
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relation of achievement (X) to students ratings of their intellectual 
abilities (Y), To be specific, he wants to estimate the linear regress- 
ion coefficient>^^j^ from the simple linear model 



(1) Y • a + 3yjjX + u , 

where a is the intercept, A^^ is the standardized renression coefficient 
froM tfie regress ion of Y on X, and u represents the lack of fit of a 
linear model and the effects of other variables on Y, independent of X, 
To estimate the investioator normally collects paired measure- 
ments (Xp,Yp) from a sample of N students (p«l,...,Il) and then uses or- 
dinary least-squares (OLS) procedures to estimate from the equation 

under the assumptions that 



1) B(Up) - 0. for all p. 

®^^p^pi) " if P-P*t constant for all p. 
i 0, othendse, 

3) BCXpUp) - 0, for all p. 
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The OLS estimator ^y^of^y^ for this model is known to be 

where C(X,Y) and V(X) are the sample covarlance of X and Y and the sam- 
ple variance of X| respectively, (From here on, subscripts are dropped 
where the interpretation of the values will remain unambiguous,) 

Under the assumptions for (2), 
... - * 

E(V - Six 

v/ith estimated mean squared error (MSE) equal to the variance of by^; 
i.e., 

where the SS(X) is the sum of squares for X. 

The next step is to estimate/^'x ^^o'^ observations grouped on some 
characteristic Earlier researchers have approached the problem of 
Orouped estimation in several ways. Three relatively distinct perspec* 
tives are discussed below. 
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The Clusterinn Perspective. The earliest treatment of anprefjation prob- 
lems in the social sciences developed from concerns over the Inflation 
of correlation coefficients as individual observations are orouped to- 
gether. This effect v/as noticed in a wide variety of investigations, 
Robinson's (19i)0) data on the ecological correlation between race and 
illiteracy were cited above. Prior to hir.i, Gehkle and Biehel (1934) 
showed the inflationary effects of grouping for data on rental values 
and delinquency rates, Thorndike (1939) cited the same fallacy in the 
psychological literature over the correlation between family size and 
delinquency, and Yule and Kendall^ examined correlations among consol- 
'idated regional crop yields. 

Each investigator tried to uncover the mechanism responsible for 
what he perceived to be the grouping artifact, Most arrived at essen- 
tially the same conclusions from their different algebraic formulations. 
Since Robinson's work on ecological correlation has reveived much at- 
tention (See Alker (1969); Hannan (1970,1971 )Jj, it will be summarized 
here, with some modification in notation, as an example of the cluster- 
ing perspective, 

Robinson employs "covariance theorems" to decompose the sum of 
squares and sums of cross-products into their wi thin-group and between- 
group (ecological) components, Given a sample of size N comprised 
of groups equal size n, 
SS(XY) « WSS(XY) + SS(XY), 

mm 

where WSS(XY) and SS(XY) are the within-groups and between-groups 
sums of cross-products, respectively. Similarly, 
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ss(x) « wss(x) + ss(5() , 

where WSS(X) and SS(1() are the v/ithin-group and between-nroups sums of 
squares for X. 

The above equations are substituted Into the formula for the cor- 



are the familiar correlation ratios for X and Y, 

The relationship between r^^^ and r^^y complex but describable. 

In his interpretation of Equation (5), Robinson identified several 

typical effects of consolidatinf! units: 

(1) The ecological correlation decreases as groups become wore 
heterocjeneous since th*^ v/ithln-aroun correlation increases 
directly with increasinn coarseness' and the between-group 
proportion of the variation equals 1 - Wr^y* 

(11) The correlation ratios and^y decrease as the between-groups 
variation becomes smaller. 

(ill) Of the first two effects, the chanqes in the correlation ra- 
tios are considerably mora Important than the chanqcs in 
Wr^y so that the numerical value of the ecological correla- 
tion increases with increaslnr consolidation of units. 

Thus, according to the clusterino approach, nrouping always inflates 

coefficients, liut, as Hannan and Durstein (1974) have pointed out, the 

clustering approach falls to explicate the nature of the grouping process. 



relation coefficient rj^y from ungrouped observations, and the terms are 
rearranged to yield 



(5) - Wrxj/riyjl/^ + 'icf iR!^ 




I 

I 
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and thereby misses certain key distinctions amonn \/ays of consolidating 
units. This is especially unfortunate since the clusterinn approach 
is mainly applicable to Context (E) v/here group niomborship is "naturally 
detern^ined", and the discovery of the nrouping mechanisms is a compli- 
cated but necessary endeavor. 

The "QDtimal -^iroupinn" Approach . Optimal grouping proponents were motiv- 
ated by the need to reduce their over-abundant data (Content (A)) by a 
grouping strategy which optimized the retention of the ungrouped infor- 
mation. Prais and Aitchinson (1954) and Cramer (1964) are larqely respon- 
sible for the basic work in this area, Cramer's formulation of the sin- 
'gle regressor case is summarized below, 

Cramer started with Equation (2) above for his model at the individ- 
ual level with one important change. He relabeled the individual obser- 
vations by letting X^j (replacing in (2)) represent the achievement 
score of the jth student in the ith group and Y^j (replacing Yp repre- 
sent the student's corresponding academic self-rating,' This was done 
to reflect the underlying, as yet unspecified, group membership, Cramer 
subsequently arrived at Equations (3) and (4) above for his individual- 
level byj^ and V(bYj^), 

Next, the group means (Ij, T^,) are found by averaging over X^.^ and Yij 
within groups, and these "grouped observations" become the units of an- 
alysis. For the substantive example, this is equivalent to grouping 
by, .say, classroom and using class mean achievement and class mean 
self-rating in the analysis. 
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The equation at the nroup level Is then 
(6)' -a + ^YX^i. + u,/. 

Under the assumptions for (2), the regression coefficient In the popula- 
tion for (6) has the same value as In (2). 

The wel.jhted least-squares estimator Ojg. of/^^yj^ from (6) Is 

V(Xj^) v(x) 

Under the assumptions above, 
and 

(8) V(Bn«) - ahi 1 ) . 

^ SS(X) 

... . ^ . 

Thus, accordlng'to Cramer, to Prals and A1tch1nson,.the grouped estimator 
Pyj^ls an unbiased estimate of/Syj^ with relative efficiency 

(9) E^b/B) - -Ifl} . . ^ , 

the familiar correlation ratio which has a value less that 1. 

However, . Cramer indicates that the estimate of the correlation co- 

cff1c1ent/S(Y tetv/een X and Y Is Inflated If the groups are not formed 
randomly. 

lilQj'Structural Eouatlons" Annroariu Dlalock (1964) considered the prob- 
lems of the grouping of observations from a causal perspective. He start- 
ed with the hypothesis that systematic grouping can lead to differential 
effects on the regression estimates of causal relations. Blalock argues 
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convincingly (later anpllfled by Hannan (1972)) that If the Investigator, 
for reasons beyond his control, groups on the dependent variable Y, or by 
variable highly related to Y, other than X, l>^;^ will be a biased esti- 
mate of Pyx* cites the facts that (1) the value r^Y s ''Yx'^XY 
flated by any systematic grouping procedure (that Is, r^^ > r^y), and 
(2) grouping on the Independent variable X (or by a variable highly re- 
lated to X) does not produce bias (that Is, Byjj » byj^ for grouping on X), 
Taking (1) and (2) together Implies that >bj^Y. Thus grouping on 
X inflates the regression coefficient when X Is the dependent variable. 
Similarly, grouping on Y Inflates the estimate of fif^* 

Blalock also describes the phenomenon In another v/ay. Grouping on 
Y results In a proportional reduction In the variation of Y and the co- 
variation of A and Y, But the variation in X exhibits a greater pro- 
portional reduction unless X and Y are extremely highly related. Since 
V(T), and not V(T), is the denominator of the sample estimator Byjj , bias 
will result from this type of grouping, 

Hannan (1972) uses a different argument to arrive at the same con- 
clusion, fie starts with the causal model 



/ 



where uf represents the Influence of other causes of Y,. lie then states 
that when variation In Y is maximized by ranking observations by their 
Y values and then grouping "adjacent" observations, observations that 
have high X and high uy values will bo placed in the highest Y groups. 



f 
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similarly, observations with both low X and low Uy values are placed 
In the nroiips lowest on Y, (assunln^ Byj^ Is positive) Thus, the other 
causes of Y are confounded with X so that Is no lonner zero In the 

•\U 

probability limit, [lannan calls this a specification error Introduced 
by nroupino and calls the bias In the OLS estimates at the group level 
agnreqatlon bias, 

^* Reconsideration of tho Paslc Model — Introduction of a "^:ro uDlno 
vanaoie" ^ ' 

Neither Blalock nor Hannan offer formal mathematical arqu- 
ments for their findings. However, their causal thinking sugnests that 
•the role of the nrouplng rule (see Thell, 1954 ) might best be explic- 
itly identified in the model even thouoh its presence is strictly dic- 
tated by its use for group formation. by introducinq a grouping variable 
into the causal structure. In other words, the criterion by which the 
Individual observations are to be grouped is treated as a random vari- 
able v/hich may be related to other variables in the system, Further- 
'"O'^ei if the nroupinn variable Z is related to another variable, the 
causal structure snecifies that Z is causallw prior to that variable. 
It does not matter that Z may appear to be caused by, say X in the sense 
that X would be logically or temporally prior to Z if the. three-variable 
model Vaf(<,Z) were under study. We visualize the grouping process as 
one in which Z can "Select" or "force" the observations from the bi- 
variate distribution of X and Y into groups. It is in this sense that 
Z is always causally prior to X and Y, 
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Structural Eouations Incor'?oratinq the nrou!?inc! Variable* Once this — 
direction of causation has been specified, the structural nodel for the 
relations aniono X, Y, and Z is easily defined. The path dlaorani for the 
causal structure when Z Is prior to X and Y Is 

C""^ — >^ 

In this dlaorani, v Is the dlstwrbance term representing all causes of 
X that are not linearly related to Z, and w Is the disturbance term re- 
presenting all causes of Y that are not related to X and Z, vyx* Yxz» 
and Yy^ are the path regression coeff dents. 

The equations corresponding to the causal structure with Z 1n- 
cor'>orated can be v/rltten 

(10a) Y - a + Yyx^ + YygZ + w . 

( b) X - X + Y^gZ + V , 
Once again, Vyj^i Yj^^ ^"^ "^YZ regression parameters, and w and v are 
disturbance terms, w Is assumed to be Independent of X, Z and v, and 
V Is also assumed to be Independent of Z. Both disturbance terms are 

hcr.oscedast1c and Independent. ( o = a = and o «oJ - oj; 

1 ^^2 1 2 

and a^^^^«o for any two persons,). 

The .lotatlon a Is retained for the Intercept term In (10a) though its 
value may aiffer from that in earlier equations. The notation x will 
••oprcsent the Intercept term In the second equation of the structural 
system; Its value may also vary according to the specified model. 
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Equation (10b) can be substituted Into (10a) to obtain a sinfile 
enuation for the renression of Y on Z and v: 

(11) Y « ( a + YYx^) ^ ( YyxYxz Yy2)2 + Yyx^ + w . 

Iruatlon (11) is actually a reparaweterization of (1) where X has been 
divided into t\io parts — the part predictable from the grouping vari- 
aule Z and a residual part v. Equations like (11) are generally called 
tt.c "ri?duced foms" for the causal structures. Equation (11) is in a 
r that cannot Le reduced further by substitution of other equations. 
The refircssion of Y u|)on X still has the regression coefficient 3yj^, 
This coufficient can be expressed In terms of the fixed parameters of - 
t.u; f O'Jifled causal structure. Besides the intercepts, the fi ad para- 
-HJters are the three regression coefficients (vyy. Yy^ Vxz). and the 
variances of X, v and w (o^, oj, o^). Substitution of the reduced form 
nvnrosslons into the formula for vielrls 

a 

. (12) 0vy - ^ 



,2 ^2 ^ ^2 

V 



t:ot1;.ator ofp fron Individual n.if;;^, tjndor the nodi.fied causal struc- 

turc, a sopple random sai.iple of iU= £ n.) observations is drawn from 

1=1 ^ 

ti.u» trivariatc distribution f{]{^y Y^j, Z^j). The sample regression 
estimator of Is given by 
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(13) b„ . ili^t^.-^ HY,^.r ) 

Under OLS assumptions, 

(1^) E(b^) . ^ 

• Yyx + YyzYxZ^"J-> (fi'ow (12)) . 

note that when X and Z have unit variance, (14) becomes 

(XS) M^) . ♦ y^y^ . 

This equation is sinply the estimate of the net effect of X on Y along 
all paths connecting the tv/o variables. When all variables are standard- 
ized b.yj^ from (15) is an unbiased estimate of the standardized regression 
coefficient, 

Tha sample variance of b^^ can be derlve'd from a theorem in Hansen- 
llurwitz-lladow (1953, Vol. II, P. 65): • 

V(b) = EgV, (b) + V2(Ei(b)) 
wliere (b) is the variance of b conditional on X and Z and V2 Is the 
variance conditional on Z, The resulting variance formula is 

'''' '^"n^-i^^^^olm^, 

The expression for the sample variance of byj^ for the modified caus- 
al structure is more complicated that for the simple model. Equations 
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(4) and (16) am similar In form, and when qroup does not result In bias, 
they are equal. 

Revised Structure for the Welnhted Groun Means. The structural equations 
for the oroup neans based on Z can be wr1tten_as 

(17a) Y . a + Yyx^ + YygZ + w , 

( b), X « X + YjjgZ + V . 

These equations aire the same as (10a) and (10b) except that grouped 
quantities have been substituted for their unprouped counterparts. There 

are still six fixed parameters in addition to the intercepts: yyx» Yy^ 

2 2 2 * 

Yxz» ^v» 

The regression coefficient for the regression of Y upon X is 



(18) 35JJ 



4 

4 

g «2 

i§iVi. 



The grouped regression coefficient 6-; can no longer be tacitly as- 

T A 

sumed equal to the ungrouped coefficient 6^^. and By^^ differ in that 
between-grouD variances replace the total variance, 
Regression Estimator from Groaned Data. A simple random sample of N ob- 
servations is drawn from the trivariate distribution f(X.., Y.., Z ), 
The X^^ and Y^j are then grouped on the basis of the values of Z^^, and 
each observation replaced by the group mean corresponding to its Z,, value; 
that is, X^^ replaces' X^j and Y^^ replaces Y^j^ 
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Let B-. denote the estimator for the regression coefficient from 

T A 

(ji^ouped data. Then 



1 ^ 



YX 



where the lower case letters denote the deviations of the group weans 
from the qrand means of the sample. 
Under OLS assumptions, 

(19) E(^j^) « ^ 

4 

" % WxaC-^) (from (18)) , 

Ohe only differences between equations (12) and (18), and conse- 
quently between (14) and (19), are that the variances of the group means 
of Z and X replace the variances of the ungrouped observations. The un- 
biased estimators of and arp byj^ and B-- respectively, but the 

Investigator wants to estimate $ from Br«, Under certain conditions 

YX ''^ 

^?JJ " ^YX ^vv ^" unbiased estimator of 

Bias and Efficiency Formulas, Equations (14) and (19) express the exoect- 
ed values of the regression slope from ungrouped and grouped data In 
terms of the parameters of the modified causal structure. The expecta- 
tion of the difference between B^^ and b^^^ Is the bias that results from 
grouping, and it will be denoted by e. 
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Then 



(20) 



0 - E(B„ . b^jjj 



o2 oi 

2 z 
^xz^vz<^ ■ ^ > 



The bias term e has a straightforward Interpretation. It Implies 
lhat tl>e grouplnn of observations leads to biased estimation If all three 

J 

of the foliov/lnn conditions hold: . 

(a) The (trouplnq variable Z has a direct relation to X {y ^ j^O). 

, Xt 

(b) The qrouping variable Z has a direct relation to Y iy^JO), 

(c) The ratio of the between-nrouos variances of Z and X does not 

equal the ratio of the total variances of Z and X. 

2 2 

Since Z Has been defined In such a way that Z^^a?^., ^1 » 
Whence, 

) - 4eC1/SS(X).Vss(X)]. 

Ob ^ 
X 

Hence, condition (c) can be restated as 

(c*) The betv/een-orouDS sum of squares of X does not equal the 
total sum of squares of X, 
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The naonltude of the bias Increases directly with the Increasing . 
relation of Z to both X and Y«X and with the reduction In the variation 
of X from qrouplnn. These three conditions are not Independent, and 
some Interestinn ramifications of their Interrelations are explored 
elsewhere, 

A formula for the variance of the grouped estimator must be present- 
ed before efficiency can be considered. By reasoning like that used to 
find VCbyj^), the sample variance of B^^ can be shown to be 

— w - 

The efficiency of the grouped estimator Is given by 

' MSE(V) 
: ^22) E(B/b).-j^§^ 

where MSE(T) denotes the mean square error for estimator T, Note that 
'msB(T) « v(T) + (bias(T))2. 

So when both estimators are unbiased, the efficiency of the grouped 
estimator Is the ratio of (16) to (21) which reduces to 

A Taxonomy for Grouplno Variables , Figure 1 presents the path diagrams 
which can result from setting various combinations of and Yj^^ 
(lOa-b) equal to zero. Each diagram represents a given set of constraints 
on the relations of Z to Y and X, 
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A taxonomy for comparing n^ouping variables can also be derived 
which will be parallel to the diagrams in Figure 1, Four categories of 
grouping variables can be distinguished: 

I. Z is directly related to both X and Y. (y^jfO, vy^^O) 
II. Z is directly related to Y but not to X. (yx^'O, y M 
in, Z is directly related to X but not to Y. (y JO, y^-^Q) 
IV. Z is not related to either X or Y. (yj^^"^. y^^fO) 
These categories include all possible relations linking causalV prior 
. grouping variables to the regression of Y on X. Certain of these cat- 
egories represent broader classes of variables. For instance, any ran- 
dom grouping method v^/in satisfy the conditions for Category IV. Most 
systematic grouping variables belong to Category I. Any grouping vari- 
able can be uniquely categorized if the variances and covariances of 
X,Y, and Z are known. 

Examination of Bias and Efficiency for Each Catenorv. Equations (20) and 
(22) can now be used to examine each category of grouping variables for 
bias and efficiency. The taxonomic categories are considered in order, 
1. Category I - y^^fo. y^^fO 
Category I includes all grouping variables which have direct rela- 
tions to both X and Y. An obvious example is that scholastic aptitude 
(Z) may be directly related to both achievement (Y) and student academic 
Interests (X), 
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In qeneral, the slope estimated from data qrouped on a Cateriory I 
variable Is a biased estimator of p . The macnltude of this bias Is 
(liven exactly by (20) for known values of Vy^* f^^» <^z» ^'x* 
Thus, - . — 

21 5c - 



Some Idea about the bias for Category I variables becomes evi- 
dent from an examination of the bias In estlmatlnci standardized rerjres. 
slon coefficients, Assume that the orlnlnal observations are stand- 
ardized and also that g oroups of equal size are formed on discrete 
values of Z so that(r|nff^. Under these conditions, 

(1) 4-o|-l. 

(2) 0$ - 4 . Y^gO^ - 1 - . 

(3) of-o^/n 



and 



((n.l)Yx2 + l)/n 



After substitution and simplification, (20) can be written 
/o«n* « r (n-l)(l-Yvz) 1 
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where e* denotes the bias of the grouped estimator of the standardized 
regression coefflclentt 

The asyrnnetrlcal properties of o* over the ranne of Yj^^ Illus- 
trated by Figure 2, Predicted bias 0* is plotted versus Yj^^.for fixed 
Yy^(«OJ) and selected values of n, A comparable family of curves can 
be nenerated for any value of y . The curves become hinhly skewed 
(rioht) for larqe n and are roughly symmetrical for small n. This 
occurs because the grouplnqs become coarser and less representative of 
the unnrouned observations as n nets larger, no matter what relations 
exist between Z and X and Y, 

Figure 2 



Table 3 Indicates the bias 6* for several values of Yy^i Y^j,* 
n. An examination of the tabled values leads to the foUowlna conclusions: • 



1. ) Bias Increases with n (unless Yy^ Is 0 or Yj^^ ^ 

2. ) For fixed Yjjjt (not 0 or 1) and n, bias Increases with Yy^* 

3. ) For fixed Yy^ (not 0 or 1) and n, bias first Increases and 

then decreases with Y)^^* ' 



Table Z 



Minimizing the direct relation to Y and maximlzinn the direct re- 
lation to X Is the safest way to achieve small bias, e* approaches Its 
naxlnim rapidly even for small values of n, Larrie n Is less damaglnq 

^hen Is larqo and Yw, Is small, thounh the necessary value of v 
'^^ Y2 XZ 

increases rapidly with Yy2. Por ns500 and Yyi^^*^* m """st be greater 




••0 



Y^2* ^e^&tion of Z to X 



Figure 2. Aggregation bias e* as ^ function of and 
gWwp size n ( y^'Q^lO) 
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Table 2. Predicted Bias in Estimatinq Standardized Renression 
Coefficient Oyj^ from Rrouned Data as a Function of 
Group Size n,Y w,, and y!. 





e**- Magnitude of the Bias TEfB^jji-s * ^1 




Group size 
n 




0 9 


v.D 


0.8 






0.2 0.5 0.8 


0.2 0.5 0.8 


0 ? n s% n ft 

V«C \)%i> U«Q 




0.037 0.060 0.035 


0.092 0.150 0.088 


0.148 0.240 0.140 


4 

•* 


0.103 0.129 0.059 


0.257 0.321 0.148 


0.411 0.615 0.237 


5 i 

i 


0.132 0.150 0.065 


0.331 0.375 0.162 


0.529 0.600 0.259 


11 

i 


0.274 0.214 0.078 


0.686 0.536 0.195 


1.110 0.857 0.311 


20 1 


0.415 0.248 0.083 


1.036 0.620 0.208 


1.658 0.991 0.333 


SO ', 

1 


0.636 0.277 0.087 


1.589 0.693 0.218 


2.543 1.109 0.349 


100 


0,766 0.288 0.089 


1.916 0.721 0.222 


3.066 1.153 0.354 


500 


0.914 0.298 0.090 


2,285 0.735 0.223 


3.657 1.190 0.359 
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than 0.60 to have bias loss than 0.1. For n« 500 and y^j^O.Z^ 
Yjj2 "'"St be oreater than 0.78 to achieve the same results. 

The expected bias can exceed 1 with larqe n and Yy^ >xZ' ^^^^ should 
be a further warninn af?a1nst choosino a grouping variable stronnly related 
to Y*X and aoalnst having a large number of observations per nroup. On 
the other hand, the relatively small bias associated with small o^" 
fers some hope for reasonable estimates from data grouped by a Category 
I variable. 

For Category I variables grouping bias affects efficiency In ado- 
ition to the variance of B--. The mean squared error for Br- for Cat- 

TA YX 

.egory I Is 



So, 



MSE(B^.) « V(B^j^) + e2(Zj). 



MSE(bvy) 

Eff,(B/b) a lA- 

^ MSE(B..) 



e2 + V(B^jj) 



1 



That is, the correlation ratio is an upper bound for the efficiency of 
Category 1 grouping. 

Though Category I groupinn Is generally less efficient that Cate- 
gory III grouping, U can bo wore efficient that Category II or Category 
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IV firounlnn. When y^j, "^yz Cateqorv I variables be- 

have like Catenon/ III variables, Thouqh the resultinn renresslon es- 
timator from Cateqory I nroiiplnfi Includes a small amount of bias, It will 
most likely be more stable (smaller mean square error) than an unbiased 
estimate from either Cateoory II or Catenory IV nroupino under similar 

conditions (number of qroups and distribution of observations among qrouos), 

t 

2, Catenory II -- « ?*0» v =0 

^YZ ^XZ 

Catenory II contains nroupinq variables Zjj which are related to 
Y(Yy2/0) and are not related to X (Yj^2"^), Since y^i-O^ 

" '^x(zn)> ' '^^vx) 
« yyx- yyx 



« 0 

Thus estimates derived from data grouped by a Cateqory 11 variable are 
/"biased, 

Th«! conclusions for Catenory II nrouninq are not surprising. 
wVft 2 is a Catenory II variable, Equation (10a) Is the orlqinal 
J^ffl (Emiatlon (1)) vhere the "other" causes represented by u have been 
'flvif^ijd Into two parts (Z and w), both Independent of X, Thus unbiased 
♦•stlr^^tes are expected under the OLS assumptions. 
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Cateqory II variables are hard to find. None of the more the 200 
pairs of parameter estinates, y ^ and y , from the empirical data de- 
scribed in Section 5 satisfactorily meet the conditions for Catepory II 
nroupinn. No doubt such variables can be constructed by some orthonon- 
alization procedure, but there are other categories of variables which 
yield unbiased estimators with greater efficiency. Henceforth, Cate- 
gory II will receive little attention. 

3, Category III - y^^^O, y ^^i/^O 
Catepory III consists of all variables Zjjj which are related to 
Y only through X. Systematic grouping on the independent variable falls 
• in Catenory III. Zjjj may be an explicit ordered function of X such 
as the decile rank of X. With this kind of groupina, the within-group 
distributions of X do not overlap. It is also possible that Z involves 
some random component (v) which allows the within-group distributions 
of X to overlap. The presence or absence of overlap is irrelevant in the 
determination of bias but can affect efficierjcy. 

Since yy^sO, equations (10a) and (17a) reduce to 

Y« a+y YjjX + w 

and 

Y « 0+ Yyj^X + \5 , 

These ecuations are like (1) though the regression parameters and dis- 
turbance term have been relabeled, Thus for Category III grouping, the 
simple model and the modified model with the grouping variable incorpor- 
ated are the same and estimate the same g . 
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Thus, the least-souares estimators from data orouped on a variable Z 
which is related to X but not to Y«X are unbiased for any value of P, 
which can be expressed in equation form by sayina that 

e (Zjii) «0 . 

The bias and efficiency of Cateoory III qroupino has been studied 
extensively, datinn back at least to Prais and Aichinson's v/ork (1954). 
Most variables systematically related to X do not strictly satisfy the 
condition - 0 and thus exhibit some ninimal bias. If this condition 
is relaxed so that is considered zero if it does not exceed three 
times its standard error (yyz<3S!:(y )), there are generally several 
' Cateoory III variables in any laroe study. This is fortunate since Cate- 
gory III estimates are always unbiased and can be hinhly efficient (Prais 
and Aitchinson, 1954). If such variables do exist in a study, the re- 
maininq decision should focus on hov/ to best utilize them to optimize 
the precision of parameter estimation. 

4. Catecfory IV y =0, y».=0 

YZ XZ 

Cateffory IV contains all variables Zj^ which have no relation to 
either X or Y. Student weiaht in ten-pound units for the study of the 
achievement-on-apt itude renression is an example of a Category IV var- 
iable. Zj^ miqht also be a variable generated by assinninp numbers ran- 
domly to individual observations, such as a student ID. Category IV 
oroupinp, alternatively called unsystematic or random qroupino, results 
in random groups of (X,Y) observations. 
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When » 0 and y j « 0.," it follows that 



7Z 



Hence, 

e (Ziv) « 0 

So for any Zjy, B-- is an unbiased estimator of 3^^. 

The Interpretation of this result Is straightforward. Estimating 

Byj^ from the means of o randomly formed qroups is statistically eoulv- 

alent to estimating 6 from a sample of size q drawn randomly from the 

YX 

N observations or from the q strata means where the stnata have been 
randomly formed (Hansen et al., 1953). In either case, the random pro- 
cess does not alter any preexisting relations amono the variables. All 
variances and covarlances amono the variables decrease In proportion to 
the number of observations In a groups for fixed group size n for Cate- 
gory IV grouping. This proportional reduction in magnitude does not 
alter the estimate of the regression coefficient. 

Category IV variables may not be the best choices for grouping when 
efficient estimates are desired because of the difficulty of obtaining 
an adequate number of groups to overcome the marked efficiency reduction 
(Feige and Watts, 1973). Unfortunately, in many cases Category IV var- 
iables may be the only ones for which the investigator has sufficient un- 
derstanding to form groups. 

Using B ^^^ and e ^ to Predict Bias. One interestino finding is that the 

Investigator need not eactually estimate y and yy? to detennine the pos- 

YZ '^^ 

sible bias from a given grouping method. If ^"^i^ ^yz'" "a^/Zi^ 
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are determinable, (this Is the case when either X or Y has been collected 
anonymously so that pairs of observations cannot be matched at the Indi- 
vidual level), an upper bound for aqnreriatlon bias can be estimated un- 
der most prevail Inq conditions. This Is accomnllshed by substltutinn 
^2 ^YZ ^" formula (20) to get 



^YZ 



(e). 



TYZ 

Thus the Investigator need not be hampered by response anonymity In choos- 
ing the "best" grouping variable for his study. 

Estimating p^y From a Systematic Grounlna Procedure, The results for 

the blvarlate case also suggest that estimates of~ can be 

XY 

from a systematic grouping procedure. The standardized regression coef- 
ficient 3*j^(The "*" designates -standardized parameters and estimators,) and 
the zero-order correlation coefficient Pj^y e""^! so that the grouped 
estimator By^ Is an estimate of both B*^ and r^^y (and thereby an estimate 
of both the regression and correlation coefficients In the standardized 
case). So "good" methods of estimating b are also good methods for es- 
timating r when the original observations have first been standardized. 
This result should prove useful for persons Interested only In correlation 
coefficients. 

The Taxonom y as a Qulde for Investigation. The main Implication from the 
above discussion Is that the Investigator should consider the relations 
of the alternative grouping variables to the study variables before col- 
lecting his data, using such prior knowledge as Is available, This will 
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enable him to collect infoniiation only on those grouplno variables which 
yield estimates havinq the desired properties. 

If the investiqator demands an unbiased estimate of s • then, under 

YX 

the assumptions of the model, variables from Categories II, III, and IV 
are satisfactory. While Catenory IV variables can always be found or 
created, they are relatively inefficient. Category III variables can be 
highly efficient, yielding large values of SS()(). The efficiency of 
Category II grouping is no better than for Category IV grouping because 
observations are assinned to groups essentially randomly with respect to 
X. Category III variables are clearly the best choices for data aggrega- 
, tlon. 

Category I variables yield biased estimates though the bias can be 
small for large y.^^ ^"^^^^ Yyz* Category I estimators are less effi- 
cient than those from Category II or Category IV grouping. If small bias 
is tolerable and Category III variables are hard to find, Category I 
grouping may be advisable. 

Most of the discussion has assumed that* an investigator has the or- 
iginal observations and can choose his own grouping procedure. Data 
can be available in aggregated form only, however} e.g., when individual 
data have been aggregated for economy of storage or for confidentiality. 
The grouping variables that generally appear under these circumstances 
are geographic variables such as "state" and "census tract", and school 
system delimiters such as "school", "teacher", and "classroom". These 
grouping variables are generally related to X and Y«X and are thereby 
subject to the criticisms of Catef*ory I grouping. Regression estimates 
determined under these conditions should be Interpreted cautiously. 
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4. An Etppirical Example— The Regression of Academic SeT-^-Ratlng on 

A'Ghlevsnept 

The literature on aggregation offers very little in the way of 
concrete demonstrations of the likely magnitude of aggregation bias in 
realistic causes. This sort of work is quite important in informing the 
substantive researcher as to vne likely consequences of alternative 
grouping strategies. Information collected on incoming freshmen at a 
large midwestern university will serve as the data base for an empiri- 
cal demonstration of the grouping methods described in the taxonomic 
approach to groupin^g. 

Over 300 measures of the abilities, attitudes, and interests of 
the students were collected in the original study. Approximately 20 
of these measures are used in the present analysis. To avoid confound- 
ing missing-data problems and aggregation problems, the analyses are 
performed only on the 2676 students with complete information on all 
measures,^. 

Regression Model from Unqrouped Data. The parameter of interest is the 
regression coefficient from the regression of a composite self-rating 
academic abilities (SRAA) on achievement test performance (ACH). The 
main reason for the proposed order is a concern for clarity of illus- 
tration as the chosen causal ordering appears to be more informative 
with regard to the effects of grouping than the reverse order. 

The two primary variables were chosen so that the example reflects 
the kind of study where anonymity of response might be a problem. None 
of the empirical data was collected anonymously so that the results 
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from treating the data as completely identified can be compared to the 
results when anonymous collection of some information is assumed. 

Using the 2676 observations at the individual level, the equation 
relating achievement to academic self-rating Is 

SRAA « -29.136 + .344(ACH) 
so that by^ = .344. Also, 
SE(bYx) « 0.011 
rj(Y a 0.529 
and R^y = 0.281. 

'Identificat ion of Grouping Variables . The grouping variables for the 
example are described in Table 3. They were selected from availan)le 
Information because (1) previous studies of the relation between achieve- 
ment and academic self-rating included similar Indicators (e.g., paren- 
tal income (PARING), father's education (FATHED))or (2) the frequency 
distribution of the variable and its correlations with ACH and SRAA 
(see Table 4) suggested that It might represent a particular taxonomic 
category. 



Table 3 

The seventeen grouping variables are mostly student reports of par- 
ental characteristics and of their own background and attitudes as meas- 
ured by single-item scales. These single-item Indicators generally have 
low reliability but are eadly manageable for grouping because of their 
limited number of response choices. Some, however, turn out to be sur- 
prisingly good grouping variables. 
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Table 3, Variable 1 dontlfications, descriptions, and the number of 
groups formed after data aogrenatlon. 



Variable 
Identification 



102 
IDl 

HSCiPA2 

SAT2 

ACH2 

PARING 

REPGPA 

FATHED 

SRAA2 

ANTHIDEG 

HSMATH 

HSPHYS 

NOBOOK 

PARASP 



Variable Descrlntlon 



Number of Groups 
After Annrenatlon 



Last 2 digits of student indentl- 100 
ficatlon 

Last digit of stUs-ient identlfica- 10 
tlon 

High school's report of student's 23 
grade point average on a 4-po1nt 
scale (highest 2 digits) 

Highest 2 digits of Total Score 13 
from the Scholastic Aptitude Test 

Highest 2 digits of Total score 10 
from the Achievement Battery 

Student's best estimate of 1970 10 
parental income before taxes 

Student's report of average 7 
grade in secondary school 

Student's report of highest level .6 

of formal education obtained by his 

father 

Highest digit and sign of composite 5 
academic self-opinion ' 

Student's anticipated highest 5 
academic degree 

Student's report of number of semesters 5 
of high school mathematics 

Student's report of number of semesters S 
of high school physical sciences 

Student's report of number of books 5 
In the home 

"What is the hinhest level of education 5 
that your parents hope you will complete?" 
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Variable 
Identification 



Variable Descrint i on 



Number of Groups 
After Annrcnatinn 



CLIMP 

COLEFF 

QCJOB 



"My grades are narkedly better in 4 
courses that I see I will need later." 

"I often wonder if four years of college 4 
will really be worth the effort." 

"I often wish that I were offered a good 4 
job now so I wouldn't have to spend four 
years in college." 
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Table 4 contains the means, standard deviations, and skewness coef- 
ficients for each study variable. Particularly interesting is the behav- 
ior of the four- and five-choice scales. Though all seven a^'e similar 
with respect to the magnitude of their standard deviations, four have 
highly negatively skewed distributions (HSnATH,QCJOB,PARASP, and NOBOOK). 
In general the large skewness values result from an uneven distribution 
of observations among groups and a high degree of consolidation -at one 
end of the scale or the other. This sort of distribution is not condu- 
cive to precise estimation. So it might be expected that estimates from 
data grouped by these variables would be less precise than the estimates 
,from data grouped by variables with a more even spread of observations 
among groups and a more symmetric distribution. 

• Table 4 

Another factor to consider in examining the grouping variables is 
the relative coarseness of the grouping as represented by the number of 
groups formed. In general characteristics resulting in the formation of 
more groups yield more precise estimates. In fact, the relative effi- 
ciency of grouping on two variables with approximately the same rela- 
tions to SRAA and ACH, is largely determined by the differences in the 
number of groups formed by each (of course this result is tempered by 
uneven distribution of observations among the groups). 
Categorizing the Grouping Characteristics. The process whereby group- 
ing characteristics are classified into taxonomic categories requires 
more information about the grouping variables than is provided in Table 
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Table 4. Means, standard deviations, and skewness coefficients of 
study variables. 



Variable 

M Vlli'W 




Standard 

UeV la 0 1011 




SRAA 


0 008 


Ifi OR? 




ACH 


84.766 






ID2 


49.561 






IDl 


4.453 




rt nil 


HS'iPA2 


3.1S7 


n iiRo 


••Uf uo/ 


SAT2 


10.235 


1 • / vo 




ACH2 


8 024 


1 

1 tv/c. 




PARING 




9 9ftQ 




REP.jPA 


3 20? 


1 9PA 




FATHED 




1 Alft 
1 •*r lO 




SRAA2 


0 OOR 




A AAA 

0.399 
« 


ANTHIDEG 


3 R67 




U.u87 


HSMATH 


4 332 


n ft7Q 


-1 .ZoO 


HSPHYS 


2.623 


0.977 


0.319 


NOBOOIC 


4.104 


0.978 


-0.769 


PARASP 


4.458 


0.626 


-1.523 


CLIMP 


2.201 


.0,821 


0.304 


COLEFF 


2.695 


0.951 


-0.209 


QCJOB 


3.330 


.821 


-1.151 
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4. Table 5 contains estimates of the un standardized regression coeffl- 
clentSi yyz ^^^Y}(z> ^^^^^ standardized counterparts, ytjiand yxzi zero- 
order correlation of Y and Z, py2i and the between-groups standard devia- 
tion of the Independent variable ACH, aj(, for each of the grouping char- 
acteristics. 



Table 5 



Applying the criterion that a parameter must exceed three times their 
standard errors to be considered nonzero leads to the following category 
assignments for the grouping characteristics: 

YYZ S 3SE(?YZ) YYZ < 3SE(?yz) 

Category I Category III 

HSfiPA2 . HSMATH ACH2 

^ SAT2 NOBOOK PARINC 

Yv7 > aSEWyy) ANTDEG PARASP HSPHYS 

JZ REPf5PA COLEFF CLIMP 

FATHED QCOOB 

SRAA2 

Catenory II Category IV 

(NONE) ID2 
Yyz < 3SE(^Y2) IDl 

As mentioned previously, no characteristics belong to Category II, 
and the number falling 1n Category I Is large, SRAA2 and ACH2 are soeclal 
cases within Categories I and III, respectively. They are the best approx. 
imatlons to what Blalock (1964) and Hannan (1971) have called "grouping 
oiTi the dependent variable" and "grouping on the Independent variable," 
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Table 5« ^tlmates of the unsUndardizod regression coefficients 
(Vyz andY\j2h the standardized reqression coefficients 
(Vyz anclY;(2), the zero-order correlation of Y and I i/iz)^ 
and the between-ciroups standard deviation of ACH (07) for 
each oroupinq variable from the rearession of SRAA a'fi ACH. 



Variable 

Manic 


A 


^ Parameter Estimates 
"^XZ 'YZ 'XZ 




°x 






.011 


•008 


.020 


.019 


2.918 






- .225 


-.011 


-.042 


AAA 

-.033 


1.208 






17.636 


.123 


.535 


.370 


8.525 


SAT2 




7.114 


.406 


.827 


.566 


H f% AAA 

12.833 


ACH2 


LAI 


9.670 


.070 


.983 


.522 


% f* AAA 

15.203 


•PARING 


191 


.470 


.028 


.070 


•064 


1.891 




m 9 09*5 


» 5.900 


-.258 


-.490 


-.455 


7.874 




Sift 


1.512 


.073 


.139 


• 14o 


A A A H 

2.321 


SRAA2 




10.690 


.819 


.476 


.885 


^ A A*f 

7.427 


ANTHI0E6 


1.956 


2.512 


.186 


.156 


.264 


2.455 


HSMATH 


- .757 


8.429 


-.066 


,.479 


.202 


7.556 


HSPHYS 


.469 


5.033 


.046 


.318 


.209 


5.635 


NOBOOK 


1.252 


2.312 


.122 


.146 


.196 


2.281 


PARASP 


2.212 


1.628 


.138 


.066 


.186 


1.192 


CLIMP 


• .043 


2.767 . 


.003 


.147 


.074 


2.508 


COLEFF 


1.277 


2.173 


.121 


.134 


.189 


2.223 


QCJOB 


1.775 


1.986 


.145 


.105 


.199 


1.770 
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Estimates of Renression Coefficients from Different r.rouninn Methods. 
Table 6 contains estimates of the unstandardized rearession coeffi- 
cients, their standard errors, the observed and predicted qrounira bias 
(with 0 andTT ), and estimates of the mean squared error for each nrouo- 
intj method. The orounina variables have been orqanized by catenory (in 
the order IV, III, and I) and by the size of the observed bias within 
cateciory, ACH2 and SRAA2 have been assigned to snecial subcatenories 
IIP and r in recognition of their unique relation to the main variables. 

Table 6 ' 

In general the estimates conform to expectations though the bias and 
mean squared error (MSE ) are enormous for some Category I groupings* 
Category IV grouping yielded estimates with relatively small bias. In 
fact, the precision (small bias and mean squared error) of the estimate 
from ID2 is exceeded only by grouping on the independent variable (by 
ACH2), But it took ten times as many groups to achieve this level of 
accuracy, 

The magnitude of the bias from grouping by IDl (10 qroups) and its 
MSE , represent, respectively, a ten- and seven-fold increase over the 
corresponding values from ID2 grouping (100 groups), IDl performs less 
well than certain variables from other categories, especially for those 
variables forming as many groups. The Category III variables that form 
10 groups (ACH2 and PARINC) yield smaller bias and smaller MSE than 
IDl, The two Category I variables that form more than 10 groups (SAT2 
and HSSPA2) result in larger bias but smaller MSE ^ Finally, HSPHYS 
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Table 6. Grouplno bias in unstandardized rcnression coefficients 
from the renressicn of Sf^AA on ACHI 
Ungrouped Estimates: b = ,344, SE(b ) » 108, MSE(b ) » ,0001 

YX YX YX 



Grouping 
Variables 

Catenory IV 

ID2 
101 



Observed 



,339 
,286 



Category III ' 

ACH2 ,344 

Catenory III 



PARING 
HSPHYS 
CLIMP 

Category I 

SRAA2 

Category I 

HSMATH 

SAT2 

HSriPA2 

FATHED 

REPGAP 

NOBOOK 

COLEFF 

ANTHIDEG 

QCJOB 

PARASP 



,362 
,370 
,466 



1,200 



,269 
,435 
,454 
,589 
,593 
,858 
,944 
1,058 
1,142 
1,260 



,0496 
,1184 



,0393 



,0850 
,0592 
,2568 



,0406 



,0592 
,0434 
,0186 
,1057 
.0401 
,0765 
,0958 
,1741 
,2695 
,4758 



Bias 



a 



-,005 
.,058 



,018 
,026 
.121 



,856 



.,076 
.091 
.110 
.245 
.249 
.514 
.600 
.714 
.798 
.926 



Predioted Bias 
9^ 



.082 
.062 
-.012 



-.066 
.099 
.098 
.285 
.235 
.520 
.497 
.730' 
.748 
.987 



.192 
.282 
.263 



.203 
.137 
.294 
.567 
.413 
,838 
,778 
1,088 
1,025 
1,240 



,003 ,007 ,0025 
,047 ,168 ,0175 



.002 ,007 ,0016 



,0075 
,0041 
.0840 



,846 ,912 ,7310 



.0062 
.0100 
.0122 
.0707 
.0631 
,2690 
,3680 
,5401 
,7078 
1,0637 



Observed Bias « B^ • b^ 



IT 



Burstoln 

.41. 

BEST COPY AVAllABli 

yields smaller bias than IDl, and both HSPHYS and HSMATH yield smaller 

M S E thouoh they form only half as many nroups. Clearly, randoio group- 
ing should be avoided unless many riroups can be formed and there are no 
other variables with systematic relations to the main variables that also 
y1<ild a larger number of qroups. 

Three of the four Catenory III variables yield b%hly satisfactory 
estimates of jjyj^ with small M.S E 's. The exception is OlIMP, whose es- 
timate has only the ninth smallest bias and the 'Seventh smallest M S E. 
The fact that all of CLIMP's relation to SRAA operates through ACH plus 
the small number of groups formed (4) might account for the ambiguous 
results with this grouping variable. 

f 

Three Category I variables, HSMATH, SAT2, and HSSPA2, yield relatively 
precise (within .11) estimates of jJyx* All have substantial zero-order 
correlations with ACH, zero-order correlations with SRAA that are clearly 
smaller and result In large between-groups standard deviations of ACH. 
Their M S E are also respectably small. An Investigator who uses a group- 
* ing variable of their calibre will not reach* conclusions that differ in any 
drastic manner from the investigator who works with the individual-level 
observations. 

The remaining Category I variables, including SRAA, yield particu- 
larly poor estimates of the relation of ACH to SRAA. ranges from ,59 
(FATHED) to an astonishingly large 1.26 (PARASP) for these variables, al- 
most four times the ungrouped effect (5.9)1 Their estimates are also un- 
stable with mean squared errors ranging from .06 to 1,06, The M S E for 
PARASP is 10,000 times larger than the M S E for the estimate from un- 
grouped data. 
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The example again demonstrates that firoupinci on the dependent vari- 
able is disastrous in terms of bias. The unmeasured factors represented 
by the disturbance term in the initial linear model (Eq. 1) are confounded 
with the effects of the primary repressor to such a deqree that the rela- 
tion of ACH to SRAA is unrecoanizable. 

Overall, there are clear distinctions between the performance of 
Category I variables and the other qroupinq variables. In every case, 
the standard error of the regression estimate from a Catenory I variable 
Is larger than the observed bias; all regression estimates from Category 
111 and IV grouping fall within one standard error of b^^. So one oains 
some knowlcdcie of the accuracy of an estimate by simply categorizing group- 
• ing characteristics and then examining standard errors. 

There appear to be other warning signals for poor estimation from 
Category I variables, even when confidentiality considerations prevent 
direct estimation of gyx* Seven of the eight Category I variables that 
yielded laroe bias also had zero-order correlations v^ith SRAA that ex- 
ceeded their zero-order correlations with ACH (i.e., > Yxz(= PxZ^» 
for REPGA, andvxz have approximately the same magnitude (-.455 and 
-.490, respectively).). For SRAA2, ANTHIDEG, qCJOB, and PARASP, Yyz is 
larger thanyj^* .gh this is a comparison of a partial correlation with 
a zero-order coefficient. 

There is additional observable warning for the single-item scales in 
Category I. Grouping by six of the eight Category I variables of this 
type (excepting HSMATH and REPCPA) results in small between-groups stan- 
dard deviations in ACH. ANTHIDEG yields the largest oj (2.46) and PARASP 
yields the smallest (1.19). 

o 
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PARASP is an extremely poor grouping characteristic. Of the five 
choices for the PARASP question, 2612 {97%) responded that their par- 
ents hoped that they would complete college (response 4) or obtain a 
graduate or professional degree (response 5). Thus, PARASP really dis- 
tinguishes between only two parental aspiration choices and operates as 
If there were only two groups. It Is not surprising, then, that PARASP 
yields such a poor estimate of Pyj^. Grouping by PARASP Is the negation 
of the principles for precise estimation that were discussed In earlier 
sections. The grouping would be coarse even if th« observations were 
evenly distributed among groups; its relations to ACM and SRAA are the 
reverse of good practice; It barely maintains between-groups variation 
'in ACH, much less maximizing it; and the distribution of observations 
among the groups is so uneven that two groups rather than five would 
have been sufficient. 

There are other Category I variables that are little better. Es- 
sentially the same statements can be made about grouping by QCJOB as 
were made for grouping by PARASP. Again, there are few initial groups 

Yyz *Yxz» -^x ^^^^^ (1»77), and the observations are unevenly 
distributed (86% (2272 out of 2637) In the two highest categories.). 
ANTHIDEG suffers from similar shortcomings with less than 100 observa- 
tions for its two lowest groups. 

a. Predicted Bias vs. Observed Bias. 

Despite the high likelihood of specification and measurement er- 
rors In the simple model examined, bias predictions stand up well In 
most cases. For every grouping where the observed bias is greater than 



Bursteln 

BEST copy mum 

•44- 

.2, the predicted e Is also greater than .2. For every grouping vari- 
able yielding observed bias less than .1, the predicted bias Is also 
less than «1« 

The predicted value of e can be considered misleading for only 
two variables, IDl, and CLIMP. In the case of IDl, it is the matter of 
sign reversal that troubles us and not the difference in magnitude be- 
tween predicted and observed bias. The predicted e for CLIMP would 
lead one to expect a more precise estimate than actually occurred. The 
observed bias is not too distressing however. 

For the empirical data, the n values are larger than the observed 
bias In every case. If the grouping variables are ordered from small- 
est to largest it values, the variables with lowest values (less than .3) 
are the Category III and IV variables plus the 3 Category I variables 
with the smallest observed bias (^SMATH, SAT2, HSGPA2). 

b. Composite Estimates From Multiple Grouping Variables. 

The above findings suggest that an investigator can separate those 
grouping characteristics which lead to reasonably accurate estimates 
from those providing extremely misleading ones in empirical studies sim- 
ilar to ours. Once this separation has been accomplished, the investi- 
gator can choose the characteristic with the smallest predicted bias. 
Better yet, he can use the available information about each character- 
istic and its expected bias to form a weigiited composite estimate. The 
latter can be accomplished by weighting grouped estimates in an Inverse 
proportion to their predicted bias. One can also take the standard er- 
rors of the grouped estimates into account by giving additional weight 
to the more stable estimates. 
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Two examples of the suggested compositing were carried out. In 
Example (A) knowledge of a^^ was treated as unknown and thus the u 
values are used for weighting. The five grouping variables, excluding 
ACH2 and 102, with the smallest predicted bias were used in each exam- 
ple. Weights were determined by (1) predicted bias only and (ii) by 
the product of the predicted bias and the standard error of the grouped 
estimate. 

The results cf tne compositing process are very satisfactory. 

When Is known, the composite grouped estimates of $ are (i) .355 

YX 

and (ii) .345. When a^^ is unknown, the composite grouped estimates of 
?YX ^"^ ^^^^ Composite estimate A(ii) performs 

about as well as grouping on the independent variable ACH2. Composite 
estimate A(i) is closer to the actual $ than the estimates from any 

T A 

of the grouping variables except ACH2 and ID2. Composite estimates 
B(ii) and B(i) do nearly as well, equaled or exceeded only by the es- 
timates from ACH2, ID2, and RARINC. Clearly, judicious use of weighted 
compositing of grcuped estimates, in conjunction with the bias predic- 
tion, can lead the conscientious investigator to precise estimates from 
grouped data. 

Estimating t he Correlation between ACH and SRAA From Grouped Data . The 
zero-order correlation between SRAA and ACH, from the grouped observa- 
tions, pyj^, can be estimated by employing the procedures prescribed for 
estimating from grouped data. The only modification is that the 
Investigator standardizes his observations before aggregating them. 
Once this is done, the coefficient from the regression of ZSRAA on ZACH 
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at the Individual level Is an unbiased estimate of the correlation coef- 
ficient between SRAA and ACHi that Is, Z(b*^) « « pyj^. Thus, the 
ZSRAA-on-2ACH regression using the grouped observations yields esti- 
mates of pyjj from grouped data. Under these circumstances, comparisons 
of with byj^ are checks for grouping bias In estimating the Individual- 
level correlation coefficient. 

Table 7 Illustrates the results of estimating the correlation be- 
tween SRAA and ACH from grouped observations. The standardization pro- 
cess resulted In fewer groups for ZID2 (35) than for ID2 (100) and for 
ZACH2 (6) than for ACH2 (10). The Increased coAv^eness of these two 
•grouping variables may account for their poorer estimation In the stan- 
dardized case relative to their accuracy In the unstandardlzed esse. 
Also^HS6PA2 and REPGPA were not M?ed In this phase of the Investigation. 

TableT' 

Most of the statements made about the precision of grouped esti- 
mates In the unstandardlzed case hold for the standardized case. The 
grouping variables tend to maintain the rank ( In terms of observed bias 
and MSE) that they received In the unstandardlzed case. The grouping 
variables yielding the smallest bias and the smallest MSE In the stan- 
dardized case are the standardized counterparts of the best variables 
In the unstandardlzed case. Again, every Category III and Category IV 
estimate falls within 1 of Py^^ while every Category I estimate 

deviates by more than 1 from p^^. 
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Table 7. Grouplnci bias In estimating the correlation coefficient 
between SRAA and ACH from the standardized regression 
coefficient estimates from grouped observations, 

Ungrouped Estimates: ryx « bjj^ a ,529, Sfi(bJ^)a ,0032 



Grouping ^. 
Variables 



'v OhfifipvoH Predicted Bias ^ . 

Bias®'*" ■ 9." ■ ^0 MSEC^)** 



Category IV 



ZI02 


.500 


.1020 


-.029 


.017 


.040 


.0112 


ZIDl 


.442 


.1942 


-.087 


.075 


.225 


.0452 


Catenorv III* 














ZACH2 


.542 


.1003 


.013 


.019 




mm 


Category III 














ZPARiriC 
ZHSPHYS 
ZCLIMP 


.558 
.571 
.717 


.1390 
.1057 
.4863 


.029 
.042 
.188 


.129 
.095 
-.016 


.295 
.433 
.401 


.0202 

.0130 
.2718 


Catenorv V 














ZSRAA2 


1.832 


.0615 


1.303 


1.395 


1.507 


1.6940 


Catenorv 1 














ZHSMATH 

ZSAT2 

ZFATHED 

ZNOBOOK 

ZCOLEFF 

ZANTHIDEn 

ZQCJOB 

ZPARASP 


.414 
.671 
.911 
1.335 
1.461 
1.631 
1.853 
1.946 


.0287 

.0700 

.1818 

.1308 

.1640 

.3095 

.4327 . 

.8473 


-.115 
.142 
.382 
.806 
.932 
1.102 
1.224 
1.41/ 


-.100 
.150 
.440 
.800 
.765 
1.117 
1.188 
1.518 


.307 
.210 
.874 
1.285 
1.194 
1.586 
1.630 
2.048 


.0140 
.0251 
.1790 
.6668 
.8917 
1.3102 
1 .6854 
2.7330 
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a. Predicted Bias vs. Observed Bias. 

The predicted bias (e* when a^^ is known and n* when 0^^^ is 
unknown) provides as good a guide for selecting grouping variables as 
It did In the unstandardlzed regressions, e* Is more than .05 smaller 
than the observed bias for only ZCLIMP and ZCOLEFF. it* Is always larg- 
er than the observed bias. This underestimation can cause problems for 
the Investigator If he chooses to group by ZCLIMP, but the predicted 
bias for ZCOLEFF Is large enough to eliminate It from further consider- 
ation. 

b. Composite Estimates of p^^^ from Grouped Observations. 
Composite estimates of Pyj^ were determined from the weighted a ver- 

*age of the estimates from the grouping variables ZIDl, ZPARINC, ZHSPHYS, 
ZCLIMP, and ZHSMATH. They were: 

A(1) a .548 BCD ^YX " '^^^ 

(H) " 'Sai (11) % « .544 

The accuracy of the composite estimates based on the products of 
the expected bias and standard errors Is exceeded only by grouping on 
ZACH2. Moreover, as In the unstandardlzed case, only ZACH2, ZID2, and 
ZPARINC provide estimates that are as accurate or more accurate than 
any of the composite estimates. Compositing of the best (excluding 
the independent variable) estimates from grouped observations appears 
to be a robust procedure that will afford greater confidence than the 
estimate from observations grouped on any single characteristic beside 
the independent variable. 
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6, Concluding Remarks 

In section 5 the utility of the groupinn concepts and methods de- 
veloped In section 4 were demonstrated under realistic empirical condi- 
tions. The empirical evidence reoarding the estimation of both e and 

YX 

Pyj^ conformed to the predictions from the principle of Incorporatinq the 
grouping characteristics as variables In the structural model, which, 
In turn, lead to the taxonomic cateriorlzatlon of grouping variables. 
The latter classification resulted In clusters of readily Identifiable 
"good" and "bad" grouping variables under most aggregation conditions, 
^ Furthermore It was shown that If the Investlaator formed a weighted com- 
posite of estimates from several of his best grouping variables, his re- 
sulting estimate Is Invariably highly accurate. 

Effective strateoles for estimating simple linear regression coef- 
ficients and zero-order correlation coefficients have been demonstrated 
for the case v/hen data aggregation Is under the Investigator's control 
and the grouping characteristics under consfderatlon have at least an 
ordinal scale. To a certain degree, the results are general Izable to 
naturally aggregated data where some degree of disaggregation Is feasible. 

The next step In the Investigation of change-ln-unlts problems 
Is to come to grips with the complications caused by nominal grouping 
characteristics and by multiple regression. This paper concludes with 
comments relevant to these two problems, 

nominal Grouolnn Characteristics, The develonment of sound procedures 
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for detenn1n1ng and predictlnn nroupin?! effects when the grouping char- 
acteristic is nominal remains the most pres^inri angrenation problem In 
educational research. Cross-level inferences from agnregate sampling 
units such as schools are too frequent to overlook and condone without 
careful examination of the consequences. Unfortunately the sociological 
methods developed to date appear to be either too complicated or not 
sufficiently applicable to the analyses beyond the level of contingency 
tables (noodman (1959), Iversen(1973)). 

The approach favored by this investigator is to try to fit struc- 
tural equation methods to this important case by in some way incorporatin 
' the nominal grouping characteristics into the model as was done previous- 
ly with ordered characteristics. This approach allows the investigator 
to capitalize on the apparent strength of taxonomic reasoning in the 
analysis of grouping effects. 

To achieve the end of incorporating the nominal grouping character- 
istics into the model, two approaches seem promising. Wiley (personal 
communication) points out that nominal characteristics (School) are in 
reality manifest representations of some latent characteristic or char- 
acteristics (community commitirent to education, community resources). 
Latent structural analysis (or multiple discriminant analysis) can be 
employed to estimate the ordinal true values corresponding each group 
Index and the relations between the orimary variables and the latent 
grouning variable can then be estimated. 
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A slightly less complex procedure that may also prove fruitful 
is to simply represent any nominal grouping characteristic forming g 
groups by g-1 dichotomous variables in the basic structural model. 
The model with incorporated grouping characteristic is, then 

t - . + Yyxx + y^^zp + . . . + hz^h^i) * " 

X - » ^^^,Z(l) + . . . + + w ■ 

K YX coefficient of determination from the regression of Y 

on X, the the direct strength relation of Y to Z can be estimated from the 
square root of variation increase accounted for 

^*X.YZ(i) • • • Z(g) - 

due to the incorporation of the dichotomous regressors based on Z. The 
relations of Z to X can be estimated from 



^^'x.Z(i) • • • ^g) ' 
Neither approach yields perfect indices of the relations of a nominal 

Z to X and Y but both are worth further consideration as alternatives 

to those previously proposed. At least they present future investigators 

a starting point for refining the "structural equations" approach in the 

nominal case. 

Grouping in the Multivariate Case. The examination of the effects of 
grouping in the multivariate case is a relatively new and still developing 
line of investigation. For much of the 1950* s uad 1960*8, Praia and 
Aitchin8on*s results (no bias, efficeney optimised by maximizing 
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the between-ciroups defined the state of knowledfie on the topic. Invest- 
Iqators followlnft up on their line of Innulry (Cramer, 196 ) focussed 
on stratenles for optimal qroupinn without conslderlnp the possibility 
of bias In estimation. 

Haltovsky's (1966) work provided the first major break from the 
optimal arounln^ oersoectlve In the multiple regressor case. He studied 
alternative procedures for estimating multiple regression coefficients 
when the data are In the form of one-v.ay. classification tables so that 
fre<!uenc1es for crossclasslflcatlon are lacking. Several of Haltovsky's 
' findings are Interesting but his most Important contribution to the study 
of grouping effects Is his empirical evidence that groupings on one 
independent variable can lead to biased estimates when the hypothesized 
model contains two or more Independent variables. His data suggests 
that In the two regressor case, grouping on one regressor yields good 
estimates of the regressor' s corresponding regression coefficient but 
a distorted estimate of the coefficient from the other recrressor. 

Recent work by Felge and Watts (1972) Is even more Informative and 
definitive In the multivariate case. They develop criteria for evaluat- 
ing the analytical consequences of what they call "partial aggregation". 
The context for their results Is the problems of performing micro-level 
analyses while preserving the condi^ntiality of data. Felge and Watts' 
(1972) investigation focusses on the differences between grouped and 
tngrouped estimators of the regression parameters rather than considering 
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blas directly In their equation v/ork. They atjhibute any differences to 
one of three sources: (1) specification bias, (11) bias 1ntro.(uced by 
a grouping transformation that Is not Independent of the disturbances 
or (111) samplinn error Induced by the use of less Information In the 
grouped renresslon. 

The empirical example presented by Felge and Watts Is Illustrative 
of the variety of proupinn possibilities for census or survey data. 
They established seven qrouping rules based on demonraphic and financial 
asset Indices and thoh examined three levels of consolidation for each 
,nroMp1ng rule. 

The one shortcoming of Peine and l^tts from the perspective advanced 
by this present Investigation Is their failure to explicitly. state the 
mechanism for selecting the "best" grouping rule v/hen several options 
are available. Thus their methods require knowledge of the In- Itlal 
micro-model relations and thus facilitate description rather than pre- 
diction of bias. Investigations employing taxonomic classification with 
the "Structural equations" approach Indicate that grouping bias can be 
predicted In the multivariate case. The process Is complicated but not 
Impossible, For example, there are eight categories of grouping varl- 
ables Included In the taxonomy for the two-regressor case. These cate- 
gorles are generated by the direct relations of Z to the dependent varl- 

able (Yy^) and to each of the Independent variables (sav, and v ) 

* ^XZ M»IZ' 

Any association between the repressors (nonzero y^^^ or y^^^, depending on 
v/hlch ls priorjcan also affect bias under certain conditions. Figure 3 



Burstein 



BEST COPY mm^ 



presents the path dlaqranis for each of the 16 subcategories from the 



The results will not be discussed In detail. Table 8. gives some 
Indications of the expected results with reqard to bias, Several con- 
clusions can be drawn from the table j 

1, As long as Z has no direct relation to Y (vy^ « 0)^ no grouping 
• c ' bias. can result. 

2, When Z Is directly related to the causally prior regressor 
(U/ In this example) and to estimates of Its regression coef- 
ficlent 6^^^ are always biased but j\nb1ased estimates of 

are possible as long asVj^^O, 

3, When Z Is directly related to the causally posterior renressor 

(X) and to Y, estimates of pyj^ are always biased; In this case 

estimates of pyu biased whenever either y,„ or v Is 

v/Z ' xw 

' > nonzero. 



• Table^a^ 

The taxonomic strategies presented can easily become cumberso e as 
more regressors are Included, More research Is necessary to determine 
the efficiency of this approach especially compared to the procedures 
described by Felge and Watts, 
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Table 8. Hstirated bias from (irx)upecl obsemtlons as a" function of 
taxnnonnc cateno^ two regressor case. ^"""icn of 
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FOOTNOTES 



' The Schools of Education at Stanford University and the University 
of Wisconsin—Milwaukee and the International Association for the Evalu- 
ation of Educational Achievement partially supported this research through 
release time and computer funds for data analysis. Kathe Magayne-Roshak 
and Donald Haumant deserve special thanks for typing the manuscript under 
horrendous conditions and time constraints. 

Suzanne P. Wiviott has offered many helpful coirments regarding the 
paper. Harry LUtjohann, Lars R. Bergman, and Ingram 01 kin have also made 
substantial contributions to certain ideas expressed here. Two persons 
influenced this work to an extent far beyond which a simple mention can 
convey: Michael T. Hannan, for his continuing interchange of ideas re- 
garding the problems of data aggregation and for his willinqness to col- 
laborate with the author thereby providing a broader forum for new devel- 
opments, and Lee J. Cronbach, who has spent an enormous amount of time 
getting the author to develop his ideas more fully and carefully and to 
coimiunicate his thoughts more clearly. The errors and misrepresentations 
.that remain are solely the responsibility of the author. 

2 The contexts discussed in this section are In no way meant to be 
exhaustive in the area of data aggregation. Temporal aggregation, ag- 
gregation over conwodlties, aggregation of different responses within the 
individual have all received consideration in the literature of econom- 
ics, Econometri clans have also treated data aggregation models where 
the regression parameters are not constant but Instead, vary from unit 

to unit at the micro level but are constant at the macro Uvel (Theil, 
1964). In this investigation there is only a single regression param- 
eter when there is one regressor. 

3 Iversen (1973) recently reviewed the methods for estimating ecolo- 
gical regressions and correlations from contingency tables, but the com- 
plexity of the approaches he suggests work against their utilization. 
Also, see Hauser (1969) for a discussion of the use of contextual vari- 
ables in sociological research on individuals. 



I Oddly enough, one of the first references to the inflationary ef- 
fects of estimating correlation coefficients from grouped data was by the 
eminent psychologist E. L. Thorndike (1934). There appear to be no fur- 
ther comments on the topic from educational and psychological researchers 
except the papers questioning the appropriateness of estimating individ- 
ual learning curves from group learning curves. 

5 It Is again assumed for simplicity that groups of n observations 
each are formed such that gn « N. Otherwise, the group means (ti ,74 ) 
need to be weighted by the group size (n^ In the least-squares estima- 
tion of the parameters. 
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^ But the parameter of Interest remains pyXi the simple linear regres- 
sion coefficient. 

7 This Interpretation for Z Is In no sense arbitrary. The process of 
grouping systematically has nuch In coimion with the notion of selection. 
In fact, Lutjohann (personal communication) has suggested that the group- 
ing bias discussed here Is essentially selection bias, the result oiF a 
manipulated sampling of the observations of X and Y because of their asso- 
ciation with Z. Recent work by Goldberger (1972a, 1972b), on selection 
bias In evaluating treatment effects with non-random sampling, also hints 
at this connection. 

3 After the bulk of the analyses was completed. It was discovered that 
there were missing observations on the grouping characteristics CLIMP, 
COLEFF, and QCJOB. In addition certain modifications were made in the 
response categories of ANTHIDEG. In its original form, ANTHIDEG formed 
nine grouos. In the results reported here, however, students responding 
"Other (9)" were dropped, and students anticipating any professional 
degree beyond the masters level (responses 5, 6, 7, and 8) were collapsed 
into a single group numbered "5." The sizes of the subsamples defined 
,by the acceptable responses to CLII1P, COLEFF, QCJOB, and the modified 
ANTHIDEG were 2632, 2669, 2637 and 2646, respectively. An examination 
of the means, standard deviations, and intercorrelations of SRAA, ACH, 
and SAT for these subsamples did not indicate any consistent and impor- 
tant deviations from the estimates based on the entire 2676 observations. 
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